33 research outputs found

    Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources

    Full text link
    [ES] En los últimos años, el aprendizaje profundo ha cambiado significativamente el panorama en diversas áreas del campo de la inteligencia artificial, entre las que se incluyen la visión por computador, el procesamiento del lenguaje natural, robótica o teoría de juegos. En particular, el sorprendente éxito del aprendizaje profundo en múltiples aplicaciones del campo del procesamiento del lenguaje natural tales como el reconocimiento automático del habla (ASR), la traducción automática (MT) o la síntesis de voz (TTS), ha supuesto una mejora drástica en la precisión de estos sistemas, extendiendo así su implantación a un mayor rango de aplicaciones en la vida real. En este momento, es evidente que las tecnologías de reconocimiento automático del habla y traducción automática pueden ser empleadas para producir, de forma efectiva, subtítulos multilingües de alta calidad de contenidos audiovisuales. Esto es particularmente cierto en el contexto de los vídeos educativos, donde las condiciones acústicas son normalmente favorables para los sistemas de ASR y el discurso está gramaticalmente bien formado. Sin embargo, en el caso de TTS, aunque los sistemas basados en redes neuronales han demostrado ser capaces de sintetizar voz de un realismo y calidad sin precedentes, todavía debe comprobarse si esta tecnología está lo suficientemente madura como para mejorar la accesibilidad y la participación en el aprendizaje en línea. Además, existen diversas tareas en el campo de la síntesis de voz que todavía suponen un reto, como la clonación de voz inter-lingüe, la síntesis incremental o la adaptación zero-shot a nuevos locutores. Esta tesis aborda la mejora de las prestaciones de los sistemas actuales de síntesis de voz basados en redes neuronales, así como la extensión de su aplicación en diversos escenarios, en el contexto de mejorar la accesibilidad en el aprendizaje en línea. En este sentido, este trabajo presta especial atención a la adaptación a nuevos locutores y a la clonación de voz inter-lingüe, ya que los textos a sintetizar se corresponden, en este caso, a traducciones de intervenciones originalmente en otro idioma.[CA] Durant aquests darrers anys, l'aprenentatge profund ha canviat significativament el panorama en diverses àrees del camp de la intel·ligència artificial, entre les quals s'inclouen la visió per computador, el processament del llenguatge natural, robòtica o la teoria de jocs. En particular, el sorprenent èxit de l'aprenentatge profund en múltiples aplicacions del camp del processament del llenguatge natural, com ara el reconeixement automàtic de la parla (ASR), la traducció automàtica (MT) o la síntesi de veu (TTS), ha suposat una millora dràstica en la precisió i qualitat d'aquests sistemes, estenent així la seva implantació a un ventall més ampli a la vida real. En aquest moment, és evident que les tecnologies de reconeixement automàtic de la parla i traducció automàtica poden ser emprades per a produir, de forma efectiva, subtítols multilingües d'alta qualitat de continguts audiovisuals. Això és particularment cert en el context dels vídeos educatius, on les condicions acústiques són normalment favorables per als sistemes d'ASR i el discurs està gramaticalment ben format. No obstant això, al cas de TTS, encara que els sistemes basats en xarxes neuronals han demostrat ser capaços de sintetitzar veu d'un realisme i qualitat sense precedents, encara s'ha de comprovar si aquesta tecnologia és ja prou madura com per millorar l'accessibilitat i la participació en l'aprenentatge en línia. A més, hi ha diverses tasques al camp de la síntesi de veu que encara suposen un repte, com ara la clonació de veu inter-lingüe, la síntesi incremental o l'adaptació zero-shot a nous locutors. Aquesta tesi aborda la millora de les prestacions dels sistemes actuals de síntesi de veu basats en xarxes neuronals, així com l'extensió de la seva aplicació en diversos escenaris, en el context de millorar l'accessibilitat en l'aprenentatge en línia. En aquest sentit, aquest treball presta especial atenció a l'adaptació a nous locutors i a la clonació de veu interlingüe, ja que els textos a sintetitzar es corresponen, en aquest cas, a traduccions d'intervencions originalment en un altre idioma.[EN] In recent years, deep learning has fundamentally changed the landscapes of a number of areas in artificial intelligence, including computer vision, natural language processing, robotics, and game theory. In particular, the striking success of deep learning in a large variety of natural language processing (NLP) applications, including automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS), has resulted in major accuracy improvements, thus widening the applicability of these technologies in real-life settings. At this point, it is clear that ASR and MT technologies can be utilized to produce cost-effective, high-quality multilingual subtitles of video contents of different kinds. This is particularly true in the case of transcription and translation of video lectures and other kinds of educational materials, in which the audio recording conditions are usually favorable for the ASR task, and there is a grammatically well-formed speech. However, although state-of-the-art neural approaches to TTS have shown to drastically improve the naturalness and quality of synthetic speech over conventional concatenative and parametric systems, it is still unclear whether this technology is already mature enough to improve accessibility and engagement in online learning, and particularly in the context of higher education. Furthermore, advanced topics in TTS such as cross-lingual voice cloning, incremental TTS or zero-shot speaker adaptation remain an open challenge in the field. This thesis is about enhancing the performance and widening the applicability of modern neural TTS technologies in real-life settings, both in offline and streaming conditions, in the context of improving accessibility and engagement in online learning. Thus, particular emphasis is placed on speaker adaptation and cross-lingual voice cloning, as the input text corresponds to a translated utterance in this context.Pérez González De Martos, AM. (2022). Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/184019TESISPremios Extraordinarios de tesis doctorale

    Integrating a State-of-the-Art ASR System into the Opencast Matterhorn Platform

    Full text link
    [EN] In this paper we present the integration of a state-of-the-art ASR system into the Opencast Matterhorn platform, a free, open-source platform to support the management of educational audio and video content. The ASR system was trained on a novel large speech corpus, known as poliMedia, that was manually transcribed for the European project transLectures. This novel corpus contains more than 115 hours of transcribed speech that will be available for the research community. Initial results on the poliMedia corpus are also reported to compare the performance of different ASR systems based on the linear interpolation of language models. To this purpose, the in-domain poliMedia corpus was linearly interpolated with an external large-vocabulary dataset, the well-known Google N-Gram corpus. WER figures reported denote the notable improvement over the baseline performance as a result of incorporating the vast amount of data represented by the Google N-Gram corpus.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755. Also supported by the Spanish Government (MIPRCV ”Consolider Ingenio 2010” and iTrans2 TIN2009-14511) and the Generalitat Valenciana (Prometeo/2009/014).Valor Miró, JD.; Pérez González De Martos, AM.; Civera Saiz, J.; Juan Císcar, A. (2012). Integrating a State-of-the-Art ASR System into the Opencast Matterhorn Platform. Communications in Computer and Information Science. 328:237-246. https://doi.org/10.1007/978-3-642-35292-8_25S237246328UPVLC, XEROX, JSI-K4A, RWTH, EML, DDS: transLectures: Transcription and Translation of Video Lectures. In: Proc. of EAMT, p. 204 (2012)Zhan, P., Ries, K., Gavalda, M., Gates, D., Lavie, A., Waibel, A.: JANUS-II: towards spontaneous Spanish speech recognition 4, 2285–2288 (1996)Nogueiras, A., Fonollosa, J.A.R., Bonafonte, A., Mariño, J.B.: RAMSES: El sistema de reconocimiento del habla continua y gran vocabulario desarrollado por la UPC. In: VIII Jornadas de I+D en Telecomunicaciones, pp. 399–408 (1998)Huang, X., Alleva, F., Hon, H.W., Hwang, M.Y., Rosenfeld, R.: The SPHINX-II Speech Recognition System: An Overview. Computer, Speech and Language 7, 137–148 (1992)Speech and Language Technology Group. Sumat: An online service for subtitling by machine translation (May 2012), http://www.sumat-project.euBroman, S., Kurimo, M.: Methods for combining language models in speech recognition. In: Proc. of Interspeech, pp. 1317–1320 (2005)Liu, X., Gales, M., Hieronymous, J., Woodland, P.: Use of contexts in language model interpolation and adaptation. In: Proc. of Interspeech (2009)Liu, X., Gales, M., Hieronymous, J., Woodland, P.: Language model combination and adaptation using weighted finite state transducers (2010)Goodman, J.T.: Putting it all together: Language model combination. In: Proc. of ICASSP, pp. 1647–1650 (2000)Lööf, J., Gollan, C., Hahn, S., Heigold, G., Hoffmeister, B., Plahl, C., Rybach, D., Schlüter, R., Ney, H.: The rwth 2007 tc-star evaluation system for european english and spanish. In: Proc. of Interspeech, pp. 2145–2148 (2007)Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H.: The rwth aachen university open source speech recognition system. In: Proc. of Interspeech, pp. 2111–2114 (2009)Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: Proc. of ICSLP (2002)Michel, J.B., et al.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182Turro, C., Cañero, A., Busquets, J.: Video learning objects creation with polimedia. In: 2010 IEEE International Symposium on Multimedia (ISM), December 13-15, pp. 371–376 (2010)Barras, C., Geoffrois, E., Wu, Z., Liberman, M.: Transcriber: development and use of a tool for assisting speech corpora production. Speech Communication Special Issue on Speech Annotation and Corpus Tools 33(1-2) (2000)Apache. Apache felix (May 2012), http://felix.apache.org/site/index.htmlOsgi alliance. osgi r4 service platform (May 2012), http://www.osgi.org/Main/HomePageSahidullah, M., Saha, G.: Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition 54(4), 543–565 (2012)Gascó, G., Rocha, M.-A., Sanchis-Trilles, G., Andrés-Ferrer, J., Casacuberta, F.: Does more data always yield better translations? In: Proc. of EACL, pp. 152–161 (2012)Sánchez-Cortina, I., Serrano, N., Sanchis, A., Juan, A.: A prototype for interactive speech transcription balancing error and supervision effort. In: Proc. of IUI, pp. 325–326 (2012

    A System Architecture to Support Cost-Effective Transcription and Translation of Large Video Lecture Repositories

    Full text link
    [EN] Online video lecture repositories are rapidly growing and becoming established as fundamental knowledge assets. However, most lectures are neither transcribed nor translated because of the lack of cost-effective solutions that can give accurate enough results. In this paper, we describe a system architecture that supports the cost-effective transcription and translation of large video lecture repositories. This architecture has been adopted in the EU project transLectures and is now being tested on a repository of more than 9000 video lectures at the Universitat Politecnica de Valencia. Following a brief description of this repository and of the transLectures project, we describe the proposed system architecture in detail. We also report empirical results on the quality of the transcriptions and translations currently being maintained and steadily improved.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287755. Funding was also provided by the Spanish Government with the FPU scholarship AP2010-4349.Silvestre Cerdà, JA.; Pérez González De Martos, AM.; Jiménez López, M.; Turró Ribalta, C.; Juan Císcar, A.; Civera Saiz, J. (2013). A System Architecture to Support Cost-Effective Transcription and Translation of Large Video Lecture Repositories. IEEE International Conference on Systems, Man, and Cybernetics. Conference proceedings. 3994-3999. https://doi.org/10.1109/SMC.2013.682S3994399

    Using automatic speech transcriptions in lecture recommendation systems

    Full text link
    One problem created by the success of video lecture repositories is the difficulty faced by individual users when choosing the most suitable video for their learning needs from among the vast numbers available on a given site. Recommender systems have become extremely common in recent years and are used in many areas. In the particular case of video lectures, automatic speech transcriptions can be used to zoom in on user interests at a semantic level, thereby improving the quality of the recommendations made. In this paper, we describe a video lecture recommender system that uses automatic speech transcriptions, alongside other relevant text resources, to generate semantic lecture and user models. In addition, we present a real-life implementation of this system for the VideoLectures.NET repository.The research leading to these results has received funding from the PASCAL2 Network of Excellence under the PASCAL Harvest Project La Vie, the EU 7th Framework Programme (FP7/2007-2013) under grant agreement no. 287755 (transLectures), the ICT Policy Support Programme (ICT PSP/2007-2013) as part of the Competitiveness and Innovation Framework Programme (CIP) under grant agreement no. 621030 (EMMA), the Spanish MINECO Active2Trans (TIN2012-31723) research project, and by the Spanish Government with the FPU scholarship AP2010-4349.Pérez González De Martos, AM.; Silvestre Cerdà, JA.; Rihtar, M.; Juan Císcar, A.; Civera Saiz, J. (2014). Using automatic speech transcriptions in lecture recommendation systems. En Conference Proceedings iberSPEECH 2014 : VIII Jornadas en Tecnologías del Habla and IV SLTech Workshop. Universidad de Las Palmas de Gran Canaria. 149-158. http://hdl.handle.net/10251/54395S14915

    Hacia la traducción integral de vídeo charlas educativas

    Full text link
    [EN] More and more universities and educational institutions are banking on production of technological resources for different uses in higher education. The MLLP research group has been working closely with the ASIC at UPV in order to enrich educational multimedia resources through the use of machine learning technologies, such as automatic speech recognition, machine translation or text-to-speech synthesis. In this work, developed under the Plan de Docencia en Red 2016-17’s framework, we present the application of innovative technologies in order to achive the integral translation of educational videos.[ES] Cada vez son más las universidades e instituciones educativas que apuestan por la producción de recursos tecnológicos para diversos usos en enseñanza superior. El grupo de investigación MLLP lleva años colaborando con el ASIC de la UPV con el fin de enriquecer estos materiales haciendo uso de tecnologías de machine learning, como son el reconocimiento automático del habla, la traducción automática o la síntesis de voz. En este trabajo, bajo el marco del Plan de Docencia en Red 2016-17, abordaremos la traducción integral de vídeos docentes mediante el uso de estas tecnologías.El trabajo de investigaci´on aqu´ı presentado ha recibido fondos del programa europeo FP7/2007-2013 en virtud del acuerdo de subvenci´on no 287755 (transLectures) y del ICT PSP/2007-2013 como parte del Competitiveness and Innovation Framework Programme (CIP) en virtud del acuerdo de subvenci´on no 621030 (EMMA); as´ı como del proyecto de investigaci´on nacional TIN2015-68326-R (MINECO/FEDER) (MORE) y de la beca VALi+d de la Generalitat Valenciana ACIF/2015/082.Piqueras, S.; Pérez González De Martos, AM.; Turró Ribalta, C.; Jimenez, M.; Sanchis Navarro, JA.; Civera Saiz, J.; Juan Císcar, A. (2017). Hacia la traducción integral de vídeo charlas educativas. En In-Red 2017. III Congreso Nacional de innovación educativa y de docencia en red. Editorial Universitat Politècnica de València. 117-124. https://doi.org/10.4995/INRED2017.2017.6812OCS11712

    MLLP Transcription and Translation Platform

    Full text link
    This paper briefly presents the main features of MLLP s Transcription and Translation Platform, which uses state-of-the-art automatic speech recognition and machine translation systems to generate multilingual subtitles of educational audiovisual and textual content. It has proven to reduce user effort up to 1/3 of the time needed to generate transcriptions and translations from the scratch.Pérez González De Martos, AM.; Silvestre Cerdà, JA.; Valor Miró, JD.; Civera Saiz, J.; Juan Císcar, A. (2015). MLLP Transcription and Translation Platform. Springer. http://hdl.handle.net/10251/65747

    MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension

    Full text link
    [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements no. 761758 (X5Gon) and 952215 (TAILOR), and Erasmus+ Education programme under grant agreement no. 20-226-093604-SCH (EXPERT); the Government of Spain's grant RTI2018-094879-B-I00 (Multisub) funded by MCIN/AEI/10.13039/501100011033 & "ERDF A way of making Europe", and FPU scholarships FPU14/03981 and FPU18/04135; the Generalitat Valenciana's research project Classroom Activity Recognition (ref. PROMETEO/2019/111), and predoctoral research scholarship ACIF/2017/055; and the Universitat Politecnica de Valencia's PAID-01-17 R&D support programme.Baquero-Arnal, P.; Jorge-Cano, J.; Giménez Pastor, A.; Iranzo-Sánchez, J.; Pérez-González De Martos, AM.; Garcés Díaz-Munío, G.; Silvestre Cerdà, JA.... (2022). MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension. Applied Sciences. 12(2):1-14. https://doi.org/10.3390/app1202080411412

    Evaluación del proceso de revisión de transcripciones automáticas para vídeos Polimedia

    Full text link
    [EN] Video lectures are a tool of proven value and wide acceptance in universities that are leading to platforms like poliMedia. transLectures is a European project that generates automatic high-quality transcriptions and translations for the poliMedia platform, and improve them by using massive adaptation and intelligent interaction techniques. In this paper we present the evaluation with lecturers carried out under the Doc`encia en Xarxa 2012-2013 call, with the aim to study the process of supervise transcriptions, compared with to transcribe from scratch.[ES] Los vídeos docentes son una herramienta de demostrada utilidad y gran aceptación en el mundo universitario que están dando lugar a plataformas como poliMedia. transLectures es un proyecto europeo que genera transcripciones y traducciones autom´aticas de alta calidad para la plataforma poliMedia, mediante t´ecnicas de adaptaci´on masiva e interacci´on inteligente. En este art´ıculo presentamos la evaluaci´on con profesores que se realiz´o en el marco de Doc`encia en Xarxa 2012-2013, con el objetivo de estudiar el proceso de supervisi´on de transcripciones, compar´andolo con la obtenci´on de la transcripci´on sin disponer de una transcripci´on autom´atica previa.*The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287755.Valor Miró, JD.; Nadine Spencer, R.; Pérez González De Martos, AM.; Garcés Díaz-Munío, GV.; Turró Ribalta, C.; Civera Saiz, J.; Juan, A. (2014). Evaluación del proceso de revisión de transcripciones automáticas para vídeos Polimedia. En Jornadas de Innovación Educativa y Docencia en Red de la Universitat Politècnica de València. Editorial Universitat Politècnica de València. 272-278. http://hdl.handle.net/10251/54397S27227

    Doblaje automático de vídeo-charlas educativas en UPV[Media]

    Full text link
    [EN] More and more universities are banking on the production of digital contents to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV’s ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-tospeech. In this work we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies.[ES] Cada vez son más las universidades que apuestan por la producción de contenidos digitales como apoyo al aprendizaje en lı́nea o combinado en la enseñanza superior. El grupo de investigación MLLP lleva años trabajando junto al ASIC de la UPV para enriquecer estos materiales, y particularmente su accesibilidad y oferta lingüı́stica, haciendo uso de tecnologı́as del lenguaje como el reconocimiento automático del habla, la traducción automática y la sı́ntesis de voz. En este trabajo presentamos los pasos que se están dando hacia la traducción integral de estos materiales, concretamente a través del doblaje (semi-)automático mediante sistemas de sı́ntesis de voz adaptables al locutor.Este trabajo ha recibido financiación del Gobierno de España a través de la subvención RTI2018-094879-B-I00 financiada por MCIN/AEI/10.13039/501100011033 (Multisub) y por ”FEDER Una manera de hacer Europa”; del programa Erasmus+ Educación a través del acuerdo de subvención 20-226-093604-SCH (EXPERT); and by the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 761758 (X5gon).Pérez González De Martos, AM.; Giménez Pastor, A.; Jorge Cano, J.; Iranzo Sánchez, J.; Silvestre Cerdà, JA.; Garcés Díaz-Munío, GV.; Baquero Arnal, P.... (2023). Doblaje automático de vídeo-charlas educativas en UPV[Media]. En In-Red 2022 - VIII Congreso Nacional de Innovación Educativa y Docencia en Red. Editorial Universitat Politècnica de València. https://doi.org/10.4995/INRED2022.2022.1584

    TransLectures

    Full text link
    transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project¿s main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMedia. The first results obtained by the UPV group for the poliMedia repository will also be provided.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287755. Funding was also provided by the Spanish Government (iTrans2 project, TIN2009-14511; FPI scholarship BES-2010-033005; FPU scholarship AP2010-4349)Silvestre Cerdà, JA.; Del Agua Teba, MA.; Garcés Díaz-Munío, GV.; Gascó Mora, G.; Giménez Pastor, A.; Martínez-Villaronga, AA.; Pérez González De Martos, AM.... (2012). TransLectures. IberSPEECH 2012. 345-351. http://hdl.handle.net/10251/3729034535
    corecore